18

Quantization of Neural Networks

process is known as calibration, an important step in uniform quantization. where [α, β] is

the clip range and b is the bit-width. The clipping range, [α, β], determines the range of real

values that should be quantized. The choice of this range is crucial, as it determines the

quantization’s precision and the quantized model’s overall quality. This process is known as

calibration, an important step in uniform quantization. The clipping range can be tighter

in asymmetric quantization than in symmetric quantization. This is especially important

for signals with imbalanced values, like activations after ReLU, which always have non-

negative values. Furthermore, symmetric quantization simplifies the quantization function

by centering the zero point at Z = 0, making the quantization process more straightforward

as follows:

qx = INT( x

S ).

(2.7)

In general, the full-range approach provides greater accuracy. Symmetric quantization is

commonly used for quantizing weights due to its simplicity and reduced computational cost

during inference. However, asymmetric quantization may be more effective for activations

because the offset in asymmetric activations can be absorbed into the bias or used to

initialize the accumulator.

2.2

LSQ: Learned Step Size Quantization

Fixed quantization methods that rely on user-defined settings do not guarantee optimal

network performance and may still produce suboptimal results even if they minimize quan-

tization error. An alternative approach is learning the quantization mapping by minimizing

task loss, directly improving the desired metric. However, this method is challenging because

the quantizer is discontinuous and requires an accurate approximation of its gradient, which

existing methods [43] have done roughly that overlooks the effects of transitions between

quantized states.

This section introduces a new method for learning the quantization mapping for each

layer in a deep network called Learned Step Size Quantization (LSQ) [61]. LSQ improves

on previous methods with two key innovations. First, we offer a simple way to estimate

the gradient of the quantizer step size, considering the impact of transitions between quan-

tized states. This results in more refined optimization when learning the step size as a

model parameter. Second, we introduce a heuristic to balance the magnitude of step size

updates with weight updates, leading to improved convergence. Our approach can be used

to quantize both activations and weights and is compatible with existing techniques for

backpropagation and stochastic gradient descent.

2.2.1

Notations

The goal of quantization in deep networks is to reduce the precision of the weights and the

activations during the inference time to increase the computational efficiency. Given the data

to quantize v, the quantizer step size s, and the number of positive and negative quantization

levels (QP and QN), a quantizer is used to compute ˆv, a quantized representation on the

whole scale of the data, and ˆv, a quantized representation of the data at the same scale as

v:

¯v =clip(v/s,QN, QP )

(2.8)

ˆv = ¯v × s

(2.9)